Rare copy-number variants as modulators of common disease susceptibility

Background Copy-number variations (CNVs) have been associated with rare and debilitating genomic disorders (GDs) but their impact on health later in life in the general population remains poorly described. Methods Assessing four modes of CNV action, we performed genome-wide association scans (GWASs) between the copy-number of CNV-proxy probes and 60 curated ICD-10 based clinical diagnoses in 331,522 unrelated white British UK Biobank (UKBB) participants with replication in the Estonian Biobank. Results We identified 73 signals involving 40 diseases, all of which indicating that CNVs increased disease risk and caused earlier onset. We estimated that 16% of these associations are indirect, acting by increasing body mass index (BMI). Signals mapped to 45 unique, non-overlapping regions, nine of which being linked to known GDs. Number and identity of genes affected by CNVs modulated their pathogenicity, with many associations being supported by colocalization with both common and rare single-nucleotide variant association signals. Dissection of association signals provided insights into the epidemiology of known gene-disease pairs (e.g., deletions in BRCA1 and LDLR increased risk for ovarian cancer and ischemic heart disease, respectively), clarified dosage mechanisms of action (e.g., both increased and decreased dosage of 17q12 impacted renal health), and identified putative causal genes (e.g., ABCC6 for kidney stones). Characterization of the pleiotropic pathological consequences of recurrent CNVs at 15q13, 16p13.11, 16p12.2, and 22q11.2 in adulthood indicated variable expressivity of these regions and the involvement of multiple genes. Finally, we show that while the total burden of rare CNVs—and especially deletions—strongly associated with disease risk, it only accounted for ~ 0.02% of the UKBB disease burden. These associations are mainly driven by CNVs at known GD CNV regions, whose pleiotropic effect on common diseases was broader than anticipated by our CNV-GWAS. Conclusions Our results shed light on the prominent role of rare CNVs in determining common disease susceptibility within the general population and provide actionable insights for anticipating later-onset comorbidities in carriers of recurrent CNVs. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-023-01265-5.


Note S1: Microarray-based CNV calling
Chromosome X CNVs Chromosome X CNVs were called using dedicated PennCNV modalities as previously described [1].To avoid interference between the two-letter CNV encoding (Note S1 -Table 1) and the male chromosome X hemizygosity assumption of PLINK, all individuals are (falsely) labeled as female when performing genetic analyses in PLINK.

Sample CNV quality-control
Samples on genotyping plates with a mean CNV count per sample > 100 and samples with > 200 CNVs or single CNV > 10 Mb were excluded, as these might stem from artifactual CNVs consecutive of poor-quality genotyping or extreme CNV events (e.g., aneuploidies or chromothripsis) with potentially extreme phenotypic consequences that are out of the scope of this study.

CNV encoding in PLINK
CNV matrices were encoded into three PLINK binary file sets (--make-bed PLINK v1.9;Note S1 -Table 1) as previously described [1].To reduce file size and facilitate parallelized computation, files are split at the chromosome level (i.e., for each PLINK encoding there are 24 files: 22 autosomes + pseudoautosomal regions + chromosome X).PLINK file sets were used to fit four association models mimicking different modes of CNV action: mirror, U-shape, duplication-only, or deletion-only (Main CNV-GWAS model).

Note S1 -Table 1. Encoding of CNVs
Encoding of high-confidence CNVs (|QS| > 0.5) into numerical probe-by-sample matrices (Num.) and three PLINK file set (PLINK), mimicking encoding of single-nucleotide variants.*The U-shape model uses the same PLINK file set as the mirror model but is assessed with the hetonly modifier of PLINK's glm function, allowing to compare the effect of deletion and duplications against copy-neutral individuals.QS = quality score.

Note S2: Sample filtering criteria
Starting from 488,377 samples, we applied several filters to obtain our final set of 331,522 individuals included for the CNV-GWAS analysis (Note S2 -Table 1).

STEP Filter Description Nexcluded Nremaining
START 488,377 1 Relatedness Samples were excluded if they were set to 0 in "used.in.pca.calculation"(i.e., not used for principal component analysis (PCA) calculation) in the sample QC file (v2) described in UKBB resource 531.PCA were calculated based on unrelated individuals (KING software [2] --related --degree 3), with missing rate on autosomes ≤ 0.02, and no mismatch between inferred and self-reported sex [3].Focusing on unrelated individuals allows to prevent p-value deflation due to correlated residual noise.

81,158 407,219 2 Ancestry
Only sample with "in.white.British.ancestry.subset"set to 1 (i.e., self-identify as "White British" and cluster with that group based on SNP PCA analysis [3]) in the sample QC file (v2) described in UKBB resource 531 were retained.We refer to this subgroup as "white British" for the remainder of the study.This allows to obtain a sample with homogenous genetic ancestry.

69,674 337,545 3 Retracted
Samples that were redacted or retracted their participation at the time the project was initiated (August 2020) were excluded.1. Summary of sample filtering procedure List of filters applied to generate the set of 331,522 individuals used for CNV-GWAS analysis."STEP" indicates the order in which filters were applied, with description of the exact criteria and rationale in "Description".For each step, the number of excluded individuals (Nexcluded), along with the number of remaining individuals after applying the filter (Nremaining) is indicated.

Note S3: Probe and covariate selection for main GWAS analysis
Relevant covariates and probes were pre-selected to fit tailored main CNV genome-wide association scan (GWAS) models and reduce computation time.

Covariate selection
For each disease, a logistic regression was fitted to explain disease probability as a function of age (#21003), sex (self-reported + genetically confirmed), genotyping array, and the 40 first principal components (PCs) from the single nucleotide polymorphism (SNP) genotyping data.
Nominally significantly associated covariates (p ≤ 0.05) were retained for the main GWAS.
Number of retained covariates ranged between two (sarcoidosis and multiple sclerosis) and 24 (hypertension, arthrosis, and disease burden) (Note S3 -Figure 1A) and correlated with case number of the disease, aligned with the expected gain in power for more frequent diseases (Note S3 -Figure 1B).Covariates used for the main CNV-GWAS are listed in Additional file 2: Table S2.

Probe pruning based on copy-number status correlation
The 70,631 probes with a CNV frequency ≥ 0.01% were pruned at r 2 > 0.9999 in PLINKCNV (--indep-pairwise 500 250 0.9999 PLINK v2.0), based on their CNV genotype, resulting in 18,725 probes.Pruning at such a high threshold will retain only a single probe at the core of a CNV region, where due to the recurrent nature of CNVs the correlation is extremely high.
However, it will retain multiple probes around the CNV breakpoints (BPs), where we expect variability due to true biological variation or uncertainty of the CNV calling algorithm.
Note S3 -Figure 1. Covariate selection and probe filtering (A) Left: Dark gray tiles indicate covariates (x-axis) retained for the corresponding disease and/or disease burden (y-axis) (nominal significant association).PC = principal component.Right: Number of retained covariates per disease.(B) Logarithm of number of selected covariates (y-axis) against the logarithm of number of cases (x-axis) for each of the 60 assessed diseases.Linear regression equations with 95% confidence intervals are displayed.(C) Number of probes retained after filtering (x-axis) for the mirror and U-shape (left), duplication-only (middle), and deletion-only (right) models for each of the 60 investigated diseases and the disease burden (y-axis).(D) Logarithm of number of selected probes (y-axis) against the logarithm of number of cases (x-axis) for each of the 60 assessed diseases, split by association model.Linear regression equations with 95% confidence intervals are displayed.

Probe selection
For each disease, 2-by-3 genotypic Fisher tests assessed dependence between disease status and probe copy-number (rows: control versus case; columns: deletion versus copyneutral versus duplication; --model fisher PLINK v1.9; TEST column GENO).For each phenotype, QQ plots were generated by plotting the observed against the expected negative logarithm of the Fisher's test p-value.For the disease burden, p-values from linear regression were used instead.The genomic inflation factor, l, was calculated as the median of the chisquared test statistics derived from the Fisher's tests p-values divided by the expected median of the chi-squared distribution.Overall, there was no sign of strong p-value inflation (Note S3 -Figure 2A).l values above 1.1 indicate genomic inflation, which can be caused by population structure, linkage disequilibrium, or polygenicity [5] and was observed only for 6 highly polygenic traits, with a maximum value of 1.33 for the disease burden.On the other hand, 42 traits exhibited l values below 0.9.Deflated p-values can be caused by extremely rare variants.To verify this hypothesis, l values were calculated anew, excluding probes with the 5-80% lowest CNV frequency (in incremental steps of 5%), to determine the impact of CNV frequency on genomic factor deflation (Note S3 -Figure 2B).We observed a trend of increasing l values when excluding low frequency probes, indicating that the deflation is indeed caused by probes with low CNV frequency.Importantly, low l values do not increase false positive rates.l values are available in Additional file 2: Table S3, along with the minimal CNV frequency after probe exclusion.
Finally, probes with pFisher ≤ 0.001 and a minimum of two disease cases among CNV, duplication, or deletion carriers were retained for assessment through the mirror/U-shaped, duplication-only, or deletion-only model, respectively.The number of probes retained across all models ranged between 0 (sarcoidosis, hyperthyroidism, pituitary dysregulation, rheumatoid arthritis, polycystic kidney disease, and kidney cancer) and 342 (disease burden) (Note S3 -Figure 1C) and correlated with case number of the disease, aligned with the expected gain in power for more frequent diseases (Note S3 -Figure 1D).Number of probes retained according to different models for the main CNV-GWAS are listed in Additional file 2: Table S2.The rationale behind this pre-selection is to reduce computation time without losing any associations, as it is highly unlikely that a genotypic test with p > 0.001 would yield a genome-wide significant (p ≤ 7.5 x 10 -6 ) logistic regression p-value.

Note S3 -Figure 2. Genomic inflation factor of probe genotypic Fisher tests
(A) QQ plots depicting the expected (y-axis) against observed (x-axis) negative logarithm of the genotypic Fisher test's p-values assessing the association strength between the copy-number status of 18,725 probes that passed the CNV frequency filter of ≥ 0.01% and pruning at r 2 > 0.9999 and the 60 diseases and disease burden (top stripe).Data points are expected to follow the dark gray line (95% confidence interval as gray shaded area).Phenotypes are ordered by decreasing genomic inflation factor (l), whose value is indicated in the top left corner.(B) Boxplots of l values across all 61 phenotypes (y-axis) obtained when excluding an increasing percentage (0-80%) of probes with the lowest CNV frequency (x-axis).The red line indicates l = 1 (i.e., no inflation).

A B
Note S4: Post-CNV-GWAS summary statistics processing

Summary statistics harmonization
Given the encoding of CNVs in PLINK (Note S1), we want to obtain the effect of carrying an additional "T" for the mirror (i.e., effect of increasing number of copies), duplication-only (i.e., effect of the duplication), and the deletion-only (i.e., effect of the deletion) models.PLINK selects the effect allele ("A1") as the minor allele, so that depending on the deletion and duplication frequencies, it will report the effect of "A" or "T".In the former case, odds ratios (OR) and their 95% confidence interval (CI) were harmonized to "T", i.e.,  .= / 01 - and  .=  234(01 . )±/.9: * <= /01 (45 -) , respectively.Because we use the hetonly modifier for the Ushape model, PLINK systematically reports the effect of being "AT", i.e., copy-neutral.To instead obtain the effect of having a CNV, the same transformation as described above was applied to all probes.For the disease burden, which was assessed through linear regression,  .= −1 *  > was applied when PLINK reported the effect of "A".Similarly, the confidence interval was multiplied by -1 and inverted, i.e., the lower bound becomes the upper bound and vice-versa.

Conditional analysis
Because of the high correlation between the copy-number state of tested probes, it is important to determine the number of independent CNV-disease associations identified.
Genome-wide significant associations (p ≤ 7.5 x 10 -6 ; Genome-wide significance threshold) were pruned at r 2 > 0.8 (--indep-pairwise 3000 500 0.8 PLINK v2.0).As PLINK preferentially keeps probes with higher nonmajor allele frequencies, we inputted a scaled negative logarithm of association p-value as frequency (--read-freq PLINK v2.0) to instead prioritize probes with the strongest association p-value.For the U-shape model, pruning was performed using custom code by extracting probes from PLINKCNV and re-coding them to match U-shape numerical encoding.Number of independent signals per disease was determined by stepwise conditional analysis.Briefly, for each disease and association model, the CNV genotype of the lead probe (i.e., probe with the most significant association p-value at each iteration; encoding numerically as in Note S1) was included along selected covariates in the logistic regression model and association studies were conducted anew.This process was repeated in an iterative fashion, always including the next lead probe as an additional covariate, until no probes passed the genome-wide significant threshold.

EstBB disease definition
Disease cases and disease burden were defined using the same inclusion and exclusion criteria as for the UKBB, with the notable exceptions of excluding Z12 (routine preventive screens for cancer) and D22-23 (benign skin lesions) subcodes from the exclusion list of cancer traits as due to differences in recording practices, these were much more frequent in the EstBB than in the UKBB, strongly reducing the number of controls.Furthermore, as there are no self-reported diagnoses available in the EstBB, the latter could not be used as an exclusion criterion for disease definition in the EstBB.

CNV calling and sample selection
Autosomal
Disease-relevant covariates were selected among sex, year of birth, genotyping batch

Note S6: Subgrouping of CNV carriers
When analyzing complex CNVRs (i.e., 16p13.11,22q11.2,15q13), CNV carriers were split into subgroups based on visual inspection of breakpoints and segmental duplications overlapping the region.Criteria below were used to define groups (Note S6 -Table 1).CNVs not matching any of the groups are referred to as "atypical" CNVs.Exploring the clinical records of the 12 deletion carriers, we found five diagnoses of breast cancer (a trait assessed by CNV-GWAS but that did not yield a GW-significant association), one of endometrial cancer, and one of Fallopian tube cancer, so that eight carriers (67%) had received a HBOC diagnosis (Note S7 -Figure 1B).Not only was prevalence of HBOC higher among BRCA1 deletion carriers (ORFisher = 31.0;p = 1.1 x 10 -6 ), but disease onset was earlier (HR = 17.0; p = 1.3 x 10 -15 ; Note S7 -Figure 1C).Among the four carriers with no HBOC, two had received cancer prophylactic surgery, de facto reducing the penetrance of the deletion.

CNVR
Surgeries were likely carried out based on family history of HBOC, which was reported for 6 carriers (50%), suggesting that these deletions are inherited.We did not observe higher prevalence of other cancer types (Note S7 -Figure 1B).

Medical history of low-density lipoprotein (LDL) receptor (LDLR) deletion carriers is based on
#41270 (diagnosis -ICD10) and age at diagnosis was calculated as previously described (Case-control definition and age-at-disease onset calculation).Drug usage data originates from #20003 (treatment/medication code).The list of considered hypolipidemic agents and antihypertensive/antianginal drugs (Note S8 -Table 1) was based on: https://www.drugs.com/(accessed: 29/09/2022).A minimum of 3 individuals was required for a code/drug to be displayed.For prevalence and time-to-event analysis, only E78.0 (pure hypercholesterolemia) was considered on the inclusion list; the same exclusion list as for lipidemia was used.Duplications and low-quality CNVs (|QS| ≤ 0.5) were excluded from analyses.Difference in prevalence was assessed with a two-sided Fisher test.Time-to-event analysis was performed as previously described (Statistical confidence tiers) to estimate the effect of the LDLR deletion, using sex, age, age 2 , array, and PC1-40 as covariates.

Category
LDL cholesterol measurements were available for seven LDLR deletion carriers in #42040 (GP clinical event records).LDL levels of earliest measurement on record (primary care) were compared to LDL levels from standardized blood biochemistry measurement (#30780) taken at assessment (#53) using a one-sided paired t-test.P12 was excluded as blood biochemistry LDL levels preluded the first primary care measurement.Based on #42039 (GP prescription records), P5 and P13 were identified as being prescribed statins by their general practitioner despite no record of statin usage in #20003.
We found that deletion of exon 2-6 increased risk for ischemic heart disease (chr19:11,210,904-11,218,188; ORdel = 31.2;95%-CI [7.1; 137.8]; p = 5.6 x 10 -6 ) in a BMIindependent fashion.The condition was present in 8 of 14 deletion carriers (Note S8 -Figure 1A).Heterozygous -and less frequently homozygous -mutations in LDLR represent the main genetic etiology for familial hypercholesterolemia [9], which is characterized by elevated LDL cholesterol and predisposition for adverse cardiovascular outcomes [10].Previously identified in clinical studies of familial hypercholesterolemia [11], the CNVR implicated by our analysis specifically encompasses the ligand-binding domain of LDLR [9].Confirming widespread prevalence and family history (43%) of cardiovascular diseases (Note S8 -Figure 1B), medical records of deletion carriers further revealed higher prevalence (ORFisher = 11.6;p = 7.9 x 10 -5 ) and earlier onset (HR = 5.8; p = 1.4 x 10 -7 ; Note S8 -Figure 1C) of pure hypercholesterolemia (E78.0), a code included in our lipidemia definition but that did not yield a signal pick-up by the CNV-GWAS.As we previously did not find the CNVR to associate with standardized blood biochemistry LDL levels [1], we hypothesized that the latter were lowered by hypolipidemic agents (Note S8 -Table 1).Ten (71%) deletion carriers were on statins and six (43%) were additionally using cholesterol absorption inhibitors, while the remaining four did not receive a dyslipidemia or ischemic heart disease diagnosis and harbored smaller deletions (i.e., P12-14; Note S8 -Figure 1A-B).We concluded that drugs likely masked genetically determined LDL levels, as shown by higher LDL levels in the first primary care measurement on record, measured prior to the standardized LDL measurement (pt-test = 0.03; Note S8 -Figure 1D).Despite this, the recommended target of ≤ 1.8 mmol/L for high-risk individuals [12] was never met.

[ 6 ]
), and v) were included in the EstBB SNP imputation pipeline.CNV outlier samples based on genotyping plate or extreme CNV profile, as well as individuals reporting blood malignancies were excluded, using the same criteria as for the UKBB (Notes S1 and S2).High confidence CNV calls (i.e., with quality score (QS) value of: |QS| > 0.5) of the 156,254 remaining individuals were encoded into three PLINK binary file sets, following the procedure described for the UKBB (Note S1).

( 1 -
11), and PC1-20.For each of the unique 68 autosomal CNVR-disease association signals identified in the UKBB, we identified EstBB probes that were overlapping the CNVR's genomic coordinates.Probes with an EstBB CNV, duplication, or deletion frequency ≥ 0.01%, were retained, depending on whether the mirror/U-shape, duplication-only, or deletion-only was the best UKBB model, respectively, and 11 signals were excluded due to null/low CNV frequency for all probes in the CNVR.Association studies were performed on remaining probes using disease-specific covariates and the best UKBB model, following the previously described procedure.Probes for which the regression failed to converge were discarded, leading to the exclusion of 8 signals for which all regressions failed.Summary statistics of the EstBB probe with the closest genomic location to the lead UKBB probe were retrieved for the remaining 49 signals, setting the replication threshold for significance at p ≤ 0.05/49 = 1.0 x 10 -3 .P-values were adjusted to account for directional concordance with UKBB effects by rewarding and penalizing signals with matching and non-matching effect size signs, respectively.Specifically, one-sided p-values were obtained as  $"? = ) 7*+ @ and  $"? = 1 − ( ) 7*+ @ ) for 35 concordant and 14 non-concordant signals, respectively.One-sided binomial tests (binom.test())were used to assess enrichment of observed versus expected significant replications at various thresholds ( = 0.1 to 0.005 by steps of 0.005), with the R function arguments:  the number of observed signals at ,  the number of testable signals (i.e., 49), and  the expected probability of signals meeting  (i.e., ).

Note S7 -Figure 1 .
BRCA1 deletion association with ovarian and other female cancers (A) Genomic coordinates of the 12 females (P1-12; y-axis) carrying a BRCA1 deletion (CNVR delimited by vertical dashed lines), colored according to ovarian cancer diagnosis.(B) Left: Cancer and related family/personal diagnoses received by individuals in (A).Color indicates age at diagnosis.Right: Counts per ICD-10 code.(C) Kaplan-Meier curve depicting the percentage, with 95% confidence interval, of females free of female-specific cancers over time among copy-neutral and BRCA1 deletion carriers.Hazard ratio (HR) and p-value for the BRCA1 deletion are given (CoxPH model).

Note S8 -Figure 1 .
LDLR deletion association with ischemic heart disease (A) Genomic coordinates of the 14 individuals (P1-14; y-axis) carrying an LDLR deletion (CNVR delimited by vertical dashed lines), colored according to ischemic heart disease (IHD) diagnosis.Sex of the individuals is indicated, with (M) corresponding to male and (F) to female (B) Left: Medical conditions and family/personal diagnoses and medication received by ≥ 3 LDLR deletion carriers in (A).Color indicates age at diagnosis.Right: Counts per ICD-10 code.(C) Kaplan-Meier curve depicting the percentage, with 95% confidence interval, of individuals free of pure hypercholesterolemia (E78.0)among copy-neutral and LDLR deletion carriers.Hazard ratio (HR) and p-value for the LDLR deletion are given (CoxPH model).(D) Low density lipoprotein (LDL)-cholesterol levels (y-axis) from primary care data (first available measurement) and blood biochemistry (average over instances) for six deletion carriers in (A) with at least one antecedent primary care LDL-cholesterol measurement, colored according to IHD diagnosis.P-value compares the two data sources (paired one-sided t-test).Gray horizontal line represents median LDL-cholesterol value (from blood biochemistry) in non-carriers.Light and darker green background represent recommended target values for low (≤ 3 mmol/L) and high (≤ 1.8 mmol/L) risk individuals, respectively.

Note S10 -Figure 1 .
22q11.2 CNV region (A) 22q11.2genetic landscape.Top: Coordinates of duplications (shades of blue; top) and deletions (shades of red; bottom) overlapping the maximal CNV region (CNVR delimited by vertical dashed lines) associated with ischemic heart disease (IHD), headaches, and aneurysm.CNVs are divided and colored according to four groups to reflect breakpoints at low-copy repeats (LCRs) spanning the region: A-D, A-B, B-D, C-D, with atypical CNVs in gray (Note S6).LCRs are composed of segmental duplications, represented as a gray gradient proportional to the degree of similarity.Genomic coordinates of genes and DECIPHER GD are displayed.Bottom: Negative logarithm of association p-values of CNVs (best model in parenthesis) with IHD, headaches, and aneurysm.Disease-specific CNVRs are shown with colored vertical dashed lines.Red horizontal dashed line represents the genome-wide threshold for significance for CNV-GWAS (p ≤ 7.5 x 10 -6 ).(B) Prevalence of aneurysm according to 22q11.2 copynumber (CN) and CNV group (A).P-values compare deletion (CN = 1) and duplication (CN = 3) carriers from various groups (other = A-D, A-B, C-D; all = A-D, A-B, B-D, C-D) to copy-neutral (CN = 2) individuals (two-sided Fisher test).(C) Prevalence of IHD according to CNV groups (A).P-values compare IHD prevalence among individuals carrying a CNV (duplication or deletion) spanning LCR C-D, B-D, or A-D to copy-neutral (CN = 2) individuals (two-sided Fisher test).(B, C) Error bars represent ± standard error; number of cases and sample sizes are indicated (N = cases/sample size).